Modeling Student Success

Presented by Team L02G06

Niharika, Yutong, Zichun, Yizhen, and Hannah

What’s Behind the Numbers?



  • The data comes from two Portuguese secondary schools (2014):
    • Math
    • Portuguese
  • Both consist of 32 features. (19 Numerical and 13 Categorical)
  • Target variable for prediction is G3, the final grade, ranging from 0 - 20.
  • No missing values were detected in both datasets.

What’s Behind the Numbers?


  • Both Data Sets consist of 32 features. (19 Numerical and 13 Categorical)
  • Target variable for prediction is G3, the final grade, ranging from 0 - 20.
  • No missing values were detected in both datasets.
  • Goal: Find what factors affect student grades and how to help students and educators achieve higher grades
  • Categorical data column names:
    • school, sex, address, famsize, Pstatus, Mjob, Fjob,
    • reason, guardian, schoolsup, famsup, paid, activities,
    • nursery, higher, internet, romantic

Math Data Set Distribution

Math Data distribution of each feature

Portuguese Data Set Distribution

Math Data distribution of each feature

Math Grades (G3): Predicting with Log(G3)

Math Data distribution of each feature

Portuguese Scores: Breaking It Down

Math Data distribution of each feature

Insights: Backward vs. Stepwise

Math Backward vs Stepwise

Portuguese Backward vs Stepwise

Linearity Check: Straight Talk on Assumptions

Math Boxplot

Portuguese Boxplot

Linearity Check: Straight Talk on Assumptions

Math Boxplot

Portuguese Boxplot

Homoscedasticity: Testing the Evenness

Math Boxplot

Portuguese Boxplot

Normality Check: Are We on Track?

Math Boxplot

Portuguese Boxplot

Math’s Top Model: The Winning Equation

Math Data distribution of each feature

log(G3) = 1.3968626 + 0.0274458 addressU - 0.0120656 Medu + 0.0359835 Mjobhealth - 0.0109299 Mjobother + 0.0015767 Mjobservices + 0.0313414 Mjobteacher + 0.0131041 traveltime - 0.0261929 nurseryyes + 0.0140297 famrel - 0.0135364 goout - 0.0009832 absences + 0.0081809 G1 + 0.0806517 G2.

Math Insights: What We Learned

Math Data distribution of each feature

log(G3) = 1.3968626 + 0.0274458 addressU - 0.0120656 Medu + 0.0359835 Mjobhealth - 0.0109299 Mjobother + 0.0015767 Mjobservices + 0.0313414 Mjobteacher + 0.0131041 traveltime - 0.0261929 nurseryyes + 0.0140297 famrel - 0.0135364 goout - 0.0009832 absences + 0.0081809 G1 + 0.0806517 G2.

Portuguese’s Best Fit: The Top Predictor

Math Data distribution of each feature

log(G3) = 1.373795 + 0.012734⋅age + 0.014982⋅traveltime - 0.042381⋅failures + 0.033537⋅higheryes - 0.008037⋅goout - 0.016241⋅Dalc + 0.016684⋅G1 + 0.059568⋅G2

Head-to-Head: Math vs. Portuguese


Factor Math Student Portuguese Student
Positive Family Relationships, Living in Urban Areas Age, Travel Time
Negative Kindergarten Attendance, Socializing Outside Failures, Daily Alcohol Consumption
Both Subjects Previous Grades Previous Grades

Head-to-Head: Math vs. Portuguese


Factor Math Student Portuguese Student
Positive Family Relationships, Living in Urban Areas Age, Travel Time
Negative Kindergarten Attendance, Socializing Outside Failures, Daily Alcohol Consumption
Both Subjects Previous Grades Previous Grades

What Did We Miss?

Factor Explanation
Reliance on Stepwise Selection Statistical criteria over theory.
Multicollinearity Risks Key variables may be omitted.
Cultural/Environmental Factors Lacks native speaker context.
Simplistic View Oversimplifies social factors.
Bias Risks Increased risk of bias.

What’s Next? Future Research

  • Non-linear Models: Explore random forests, decision trees for complexity.
  • Interactions: How combined factors (e.g., social time + study) impact performance.
  • Time-series Data: Gather data over time for deeper academic insights.

Time Series Data

From Data to Action: Real-World Impact

  • Early Identification: Use G1, G2 for predicting at-risk students.
  • Reduce Absences: Improve study habits for better grades.
  • Health & Well-being: Address through programs for better academic performance.

Time Series Data